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Abstract 

In this paper, we propose a two-timescale delay-optimal dynamic clustering and power allocation 



a 

(/3 ' design for downlink network MIMO systems. The dynamic clustering control is adaptive to the global 

o ■ 

queue state information (GQSI) only and computed at the base station controller (BSC) over a longer 
time scale. On the other hand, the power allocations of all the BSs in one cluster are adaptive to both 

> 

[- — ' intra-cluster channel state information (CCSI) and intra-cluster queue state information (CQSI), and 



computed at the cluster manager (CM) over a shorter time scale. We show that the two-timescale delay- 
optimal control can be formulated as an infinite-horizon average cost Constrained Partially Observed 
Markov Decision Process (CPOMDP). By exploiting the special problem structure, we shall derive an 
equivalent Bellman equation in terms of Pattern Selection Q-factor to solve the CPOMDP. To address the 
distributive requirement and the issue of exponential memory requirement and computational complexity, 
, we approximate the Pattern Selection Q-factor by the sum of Per-cluster Potential functions and propose 

. a novel distributive online learning algorithm to estimate the Per-cluster Potential functions (at each CM) 

as well as the Lagrange multipliers (LM) (at each BS). We show that the proposed distributive online 
learning algorithm converges almost surely (with probability 1). By exploiting the birth-death structure 
of the queue dynamics, we further decompose the Per-cluster Potential function into sum of Per-cluster 
Per-user Potential functions and formulate the instantaneous power allocation as a Per-stage QSI-aware 
Interference Game played among all the CMs. We also propose a QSI-aware Simultaneous Iterative 
Water-filling Algorithm (QSIWFA) and show that it can achieve the Nash Equilibrium (NE). 
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I. Introduction 

The network MIMO/Cooperative MIMO system is proposed as one effective solution to address the 
inter-cell interference (ICI) bottleneck in multicell systems by exploiting data cooperation and joint 
processing among multiple base stations (BS). Channel state information (CSI) and user data exchange 
among BSs through the backhaul are required to support network MIMO and this overhead depends 
on the number of BSs involved in the cooperation and joint processing. In practice, it is not possible 
to support such full-scale cooperation and BSs are usually grouped into disjoint clusters with limited 
number of BSs in each cluster to reduce the processing complexity as well as the backhaul loading. The 
BSs within each cluster cooperatively serve the users associated with them, which lowers the system 
complexity and completely eliminate the intra-cluster interference. 

The clustering methods can be classified into two categories: static clustering approach and dynamic 
clustering approach. For static clustering, the clusters are pre-determined and do not change over time. 
For example, in HI, 0, the authors proposed BS coordination strategies for fixed clusters to eliminate 
intra-cluster interference. For dynamic clustering, the cooperation clusters change in time. For example, in 
|[3l . given GCSI, a central unit jointly forms the clusters, selects the users and calculates the beamforming 
coefficients and the power allocations to maximize the weighted sum rate by a brute force exhaustive 
search. In lH, the authors proposed a greedy dynamic clustering algorithm to improve the sum rate 
under the assumption that CSI of the neighboring BSs is available at each BS. In general, compared with 
static clustering, the dynamic clustering approach usually has better performance due to larger optimizing 
domain, while it also leads to larger signaling overhead to obtain more CSI and higher computational 
complexity for intelligent clustering. 

However, all of these works have assumed that there are infinite backlogs of packets at the transmitter 
and assume the information flow is delay insensitive. The control policy derived (e.g. clustering and power 
allocation policy) is only a function of CSI explicitly or implicitly. In practice, a lot of applications are 
delay sensitive, and it is critical to optimize the delay performance for the network MIMO systems. In 
particular, we are interested to investigate delay-optimal clustering and power control in network MIMO 
systems, which also adapts to the queue state information (QSI). This is motivated by an example in Fig. 
[T] The CSI-based clustering will always pick Pattern 1, creating a cooperation and interference profile 
in favor of MS 2 and MS 4 regardless of the queue states of these mobiles. However, the QSI-based 
clustering will dynamically pick the clustering patterns according to the queue states of all the mobiles. 

The design framework taking consideration of queueing delay and physical layer performance is not 
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trivial as it involves queuing theory (to model the queuing dynamics) and information theory (to model 
the physical layer dynamics). The simplest approach is to convert the delay constraints into an equivalent 
average rate constraint using tail probability (large derivation theory) and solve the optimization problem 
using purely information theoretical formulation based on the rate constraint [?]. However, the control 
policy derived is a function of the CSI only, and it failed to exploit the QSI in the adaptation process. 
Lyapunov drift approach is also widely used in the literature to study the queue stability region of 
different wireless systems and establish throughput optimal control policy (in stability sense). However, 
the average delay bound derived in terms of the Lyapunov drift is tight only for heavy traffic loading 
[?]. A systematic approach in dealing with delay-optimal resource control in general delay regime is via 
Markov Decision Process (MDP) technique lUl. However, there are various technical challenges involved 
regarding dynamic clustering and power allocation for delay-optimal network MIMO systems. 

• The Curse of Dimensionality: Although MDP technique is the systematic approach to solve the 
delay-optimal control problem, a first order challenge is the curse of dimensionality |[5|. For example, 
a huge state space (exponential in the total number of users in the network) will be involved in the 
MDP and brute force value or policy iterations cannot lead to any implementable solutions |[6ll' . 

• Signaling Overhead and Computational Complexity for Dynamic Clustering: Optimal dynamic 
clustering in |[3l and greedy dynamic clustering in lH (both in throughput sense) require GCSI or CSI 
of all neighboring BSs, which leads to heavy signaling overhead on backhaul and high computational 
complexity for the central controller. For delay-optimal network MIMO control, the entire system 
state is characterized by the GCSI and the global QSI (GQSI). Therefore, the centralized solution 
(which requires GCSI and GQSI) will induce substantial signaling overhead between the BSs and 
the base station controller (BSC). 

• Issues of Convergence in Stochastic Optimization Problem: In conventional iterative solutions for 
deterministic network utility maximization (NUM) problems, the updates in the iterative algorithms 
(such as subgradient search) are performed within the coherence time of the CSI (the CSI remains 
quasi-static during the iteration updates]^ Q. When we consider the delay-optimal problem, the 
problem is stochastic and the control actions are defined over the ergodic realizations of the system 

'For a multi-cell system with 7 BSs, 2 users served by each BS, a buffer size of 10 per user and 50 CSI states for each link 
between one user and one BS, the system state space contains (10 + 1)^*^^ x 50^^^^^ states, which is already unmanageable. 

^This poses a serious limitation on the practicality of the distributive iterative solutions because the convergence and the 
optimality of the iterative solutions are not guaranteed if the CSI changes significantly during the update. 
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States (CSI,QSI). Therefore, the convergence proof is also quite challenging. 

In this paper, we consider a two-timescale delay-optimal dynamic clustering and power allocation 
for the downlink network MIMO consisting of B cells with one BS and K MSs in each cell. For 
implementation consideration, the dynamic clustering control is adaptive to the GQSI only and computed 
at the BSC over a longer time scale. On the other hand, the power allocations of all the BSs in one 
cluster are adaptive to both CCSI and intra-cluster QSI (CQSI), and computed at the CM over a shorter 
time scale. Due to the two time-scale control structure, the delay optimal control is formulated as an 
infinite-horizon average cost Constrained Partially Observed Markov Decision Process (CPOMDP).We 
propose an equivalent Bellman equation in terms of Pattern Selectio Q-factor to solve the CPOMDP. We 
approximate the Pattern Selection Q-factor by the sum of Per-cluster Potential functions and propose a 
novel distributive online learning algorithm to estimate the Per-cluster Potential functions (at each CM) as 
well as the Lagrange multipliers (LM) (at each BS). This update algorithm requires CCSI and CQSI only 
and therefore, facilitates distributive implementations. Using separation of time scales, we shall establish 
the almost-sure convergence proof of the proposed distributive online learning algorithm. By exploiting 
the birth-death structure of the queue dynamics, we further decompose the Per-cluster Potential function 
into sum of Per-cluster Per-user Potential functions. Based on these distributive potential functions 
and birth-death structure, the instantaneous power allocation control is formulated as a Per-stage QSI- 
aware Interference Game and determined by a QSI-aware Simultaneous Iterative Water-filling Algorithm 
(QSIWFA). We show that QSIWFA can achieve the NE of the QSI-aware interference game. Unlike 
conventional iterative water-filling solutions ifTTl . the water-level of our solution is adaptive to the QSI 
via the potential functions. 

We first list the acronyms used in this paper in Table D 



BSC 
ICI 


base station controller 
inter-cell interference 


CM 
LM 


cluster manager 
Lagrange multiplier 


L/C/G CSI (QSI) 

CPOMDP 

QSIWFA 


local/intra-cluster/global channel state information (queue state information) 
constrained partially observed Markov decision process 
QSI-aware simultaneous iterative water-filling algorithm 



TABLE I 

List of Acronyms 
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II. System Models 

In this section, we shall elaborate the network MIMO system topology, the physical layer model, the 
bursty source model and the control policy. 

A. System Topology 

We consider a wireless cellular network consisting of B cells with one BS and K MSs in each cell 
as illustrated in Fig. |2 We assume each BS is equipped with Nt > K transmitter antennas and each MS 
has 1 receiver antenntB Denote the set of B BSs as = {1, • • • , B} and the set of K MSs in each cell 
as /C = {1, • • • , K}, respectively. We consider a clustered network MIMO system with maximum cluster 
size Nb- Let a;„ C B denote a feasible cluster n, which is a collection of \ujn\ neighboring BSs.We 
define a clustering pattern C € C to be a partition of B as follows 

C = {oJn ^ B : n Wn' = Vn / n, U^^^(zcUJn = B} (1) 

where C is the collection of all clustering patterns, with cardinality Ic- 

As illustrated in Fig. |2j the overall multicell network is specified by three-layer hierarchical architecture, 
i.e. the base station controller (BSC), the cluster managers (CM) and the BSs. There are K user queues 
at each BS, which buffer packets for the K MSs in each cell. Both the local CSI (LCSI) and local 
QSI (LQSI) are measured locally at each BS. The BSC obtains the global QSI (GQSI) from the LQSI 
distributed at each BS, determines the clustering pattern according to the GQSI, and informs the CMs of 
the concerned clusters with their intra-cluster QSI (CQSI). During each scheduling slot, the CM of each 
cluster determines the precoding vectors as well as the transmit power of the BSs in the cluster. 

B. Physical Layer Model 

Denote MS k in cell 6 as a BS-MS index pair k). The channel from the transmit antennas in BS 
h' to the MS (6, k) is denoted as the 1 x A'^t vector fc) (V6, b' G B,k G IC), with its i-th element 
(1 < i < Nt) /i(b,A:),b'(0 G a discrete random variable distributed according to a general distribution 
Ph^^ fe) b' (^) mean and variance cr(h,k),b" where H denotes the per-user discrete CSI state space with 
cardinality Nh and cT({,,fc),fe' denotes the path gain between BS b' and MS (6, k). For a given clustering 
pattern C, let Hb,„ = {\b,k),b' : b' e Un,k e JC} (Vw^ e C,b e Un), H„ = Ufee^„Hfe,„ (Vw^ G C) 

'when A'^t < K, there will be a user selection control to select at most Nt active users from the K users and the proposed 
solution framework could be extended easily to accommodate this user selection control as well. 
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and H = U^„gcH„e H denote the LCSI at BS h in cluster n, the CCSI at the CM n, and the GCSI, 
respectively, where H denotes the GCSI state space. In this paper, the time dimension is partitioned into 
scheduling slots indexed by t with slot duration r. 

Assumption 1: The GCSI H(t)G H is quasi-static in each scheduling slot and i.i.d. over scheduling 
slots. Furthermore, /i(6,A;),f,'(t) G 7^ is independent w.r.t. {(6, /c),6'} and t. The path gain cr(;,,fe).6' remains 
constant for the duration of the communication session. ■ 

Let Sh^k and p;, ^ (V6 ^ B,k ^ K) denote the information symbols and the received power of MS 
(5, k), respectively. Denote ^{b,k),b' Q^b, b' € w„) as the Nt x 1 precoding vector for MS (6, k) at the BS 
b'. Therefore, the received signal of MS (6, k) in cluster n (ujn G C) is given by 

yfe,fc=(^ i^{b,k),b'^{b,k),b')\/Pb,kSb,k + ^ i^{b,k),b'^{b",k"),b')\/Pb",k"Sb",k" 

b'Guj„ b"£uj„,k"£JC b'£uj„ 

" (b" ,k")f^{b,k) 



desired signal 



intra-cluster interference 



^ ^ ( ^ \b,k),b''^{b",k"),b')VPb",k"Sb",k" + Zb^k , V6 E a;„,fc e /C,a;„ e C 



(.i^„'6(7 b'£uj„i 
n'^n k" die 



inter-cluster interference 

where z^^fc ~ CA/'(0, 1) is noise. Based on CCSI at the CM, we adopt zero-forcing (ZF) within each 
cluster to eliminate the intra-cluster interferencqj IH, The ZF precoder of cluster n {ujn G C) 
{w(fe,fe),fe' ■■ b,b' e ujn,k G /C} satisfies J2b'€Lu„'^{b,k),b'"^{b,k),b' = 1 (V6 G w„,/c G /C,a;„ E C) and 
Efe'ea;„ h(6,fc),6'W(b„,fc„),b, = (V6, 6" G (Jn, /c, /c" G JC, {b",k") / (6, fc)). The transmit power of BS b 
is therefore given by 

A = ^ ^ II '^(b',k),b f Pb',k, yb€B (2) 

For simplicity, we assume perfect CSI at the transmitter and receiver, and the maximum achievable 
data rate (bit/s/Hz) of MS (fe, k) in cluster a;.„ is given by the mutual information between the channel 
inputs Sfe and channel outputs y^^k as: 

Rb,k = log(l + SINRb,fc) = log (1 + -^^), V6 £uJn,k€lC,uJn€C (3) 

where Ib,k = E^^^'SC Efe"et^„- (Efe'Ga;„, |h(fe,fc),fe'W(b",fc")^fe- 1 )pb",k" (V6 e ujn,k £ /C,w„ G C) is the 
inter-cell interference power. 

''We consider ZF precoding as an example but the solution framework in the paper can be applied to other SDMA processing 
techniques as well. Our zero-forcing precoder design can also be extended for multi-antenna MS with block zero-forcing similar 
to that in [?]. 
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C. Bursty Source Model 

Let A(t) = {Ai)^k{t) : b ^ B,k £ /C} be the random new arrivals (number of bits) for the BK users 
in the multicell network at the end of the t-th scheduUng slot. 

Assumption 2: The arrival process Ab^k{t) is distributed according to general distributions Pa^^{A) 
and is i.i.d. over scheduling slots and independent w.r.t. {{b,k)}. ■ 

Let Q(t)G Q be the B x K GQSI matrix of the multicell network, where Qb,k{t) e Q is the {b,k)- 
element of Q(i), which denotes the number of bits in the queue for MS (6, k) at the beginning of the 
t-th slot. The per-user QSI state space and the GQSI state space are given by Q = {0, 1, • • • , Nq}, and 
Q = Q^^ , separately. Nq denotes the buffer size (maximum number of bits) of the queues for the BK 
MSs. Thus, the cardinality of the GQSI state space is Iq = {Nq + 1)^^, which grows exponentially 
with BK. Let R(t) be the B x K scheduled data rates matrix of the BK MSs, where the {b, fc)-element 
Rb^k{t) can be calculated using ([3]l. We assume the controller is causal so that new arrivals A(t) are 
observed only after the controller's actions at the t-th slot. Hence, the queue dynamics is given by the 
following equation: 

Qb,k{t + 1) = min { [Qb,k{t) - Rb,k{i)^] ^ + Nq^ (4) 

where = max{x, 0} and r is the duration of a scheduling slot. For notation convenience, we denote 
x{t) = (H(t),Q(t)) as the global system state at the t-th slot. 

D. Clustering Pattern Selection and Power Control Policy 

At the beginning of the t-th slot, given the observed GQSI realization Q(t), the BSC determines the 
clustering pattern C defined in ([T]), the CMs of the active clusters n (Vc(J„ G C) do power allocation 
based on GCSI and GQSI according to a pattern selection and power allocation policy defined below. 

Definition 1 (Stationary Pattern Selection and Power Allocation Policy): A stationary pattern selection 
and power allocation policy = {Q.c,^p) is a mapping from the system state X to the pattern 
selection and power allocation actions, where r2c(Q) = C G C and ilp(x) = {pb,k '-b £ B,k £ K,]. A 
policy is called feasible if the associated actions satisfy the per-BS average transmit power constraint 
given by 

¥P[Pb] < Pb, ybeB (5) 

where Pb is given by (EJ) and Pb is the average total power of BS 6. ■ 
Remark 1 (Two Time-Scale Control Policy): The pattern selection policy is defined as a function of 
GQSI only, i.e. r2c(Q)> for the following reasons. The QSI is changing on a slower time scale while 
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the CSI is changing on a faster (slot-by-slot) time scale. The dynamic clustering is enforced at the 
BSC and hence, a longer time scale will be desirable from the implementation perspective, considering 
computational complexity at the BSC and signaling overhead for collecting GCSI from all the BSs. On 
the other hand, the low complexity and decentralized power allocation policy (obtained later in Sec. IV) 
is a function of CQSI and CCSI only and executed at the CM level distributiveljl^ and hence it can 
operate at slot-time scale with acceptable signaling overhead and complexity. ■ 

III. Problem Formulation 

In this section, we shall first elaborate the dynamics of the system state under a control policy 0. 
Based on that, we shall then formally formulate the delay-optimal control problem for network MIMO 
systems. 

A. Dynamics of System State 

A stationary control policy Q, induces a joint distribution for the random process {x{t)}- Under 
Assumption [T] and [2j the arrival and departure are memoryless. Therefore, the induced random process 
{x{t)} for ^ given control policy Q is Markovian with the following transition probability: 

Pr[x(t + l)lx(t), ^{xm = Pr[H(t + l)\x{t), Qixit))] Pr[Q(t + ^{xit))] 

= Pr[H(t + 1)] Pr[Q(t + l)\x{t), (6) 

Note that the BK queues are coupled together via the control policy Q. 

B. Delay Optimal Problem Formulation 

Given a unichain policy Q, the induced Markov chain {xit)} is ergodiqj and there exists a unique 
steady state distribution ir^ where vr^(x) = limt_>oo = x\- The average cost of MS {b, k) under 

a unichain policy 17 is given by: 



_ 1 T 

Db,km = lim - V [f{QbMt))] = [fiQb,k)] ,ybeB,kelC (7) 

T—^oo 1 ^ — ' 

t=l 

^According to Definition 1, the power control policy Q,p is defined as a function of the GQSI and GCSL Yet, in Sec.IV, we 
shall derive a decentralized power allocation policy, which is adaptive to CCSI and CQSI only. 

* The unichain policy is defined as a policy under which the resulting Markov chain is ergodic (8|. Similar to other literature 
in MDP O, II3I , we restrict out consideration to unchain policy in this paper. Such assumption usually does not contribute any 
loss of performance. For example, in Section [V] any non-degenerate control policy satisfies E[pi,,fc(H, Q)|Q] > Q.yQb,k > 0, 
i.e. i^iQb,k) > 0, VQfc.fc > 0. Hence, the induced Markov chain {Q{t)} is an ergodic birth death process. 
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where f{Qb,k) is a monotonic increasing utility function of Qb,k and the E^r^ denotes expectation w.r.t. 
the underlying measure tt^. For example, when f{Qb,k) = ^6,fe(f^) = ^''^x^^ ''^ is the average delay 
of MS (6, /c) (by Little's Law). Another interesting example is the queue outage probability D^^ki^) = 
^^[Qb,k > in which f{Qb,k) = 'i-[Qb,k > Qb,k\^ where S Q is the reference outage queue 

state. Similarly, the average transmit power constraint in ^ can be written as 

_ 1 ^ - 

Pbin) = lim - V E^[Pb(t)] = E^^ [Pb] <Pb,ybel3 (8) 
where Pb is given by Q. 

In this paper, we seek to find an optimal stationary unichain control policy to minimize the average 
cost in ([7]). Specifically, we have 

Problem I (Delay-Optimal Control Problem for Network MIMO): For some positive constantj^ /3 = 
{Pb,k > n ^ B,k e K,}, the delay-optimal problem is formulated as 

T 



mm Mn) =J2Pb,kDb,km = lim ^ ^ E^ U^W, 

b,k t=l 

subject to the power constraints in ([8]) 



(9) 



where = Zb,kPb,kf{QbAt))- ■ 

C. Constrained POMDP 

Next, we shall illustrate that Problem [1] is an infinite horizon average cost constrained POMDP. In 
Problem m the pattern selection policy is defined on the partial system state Q, while the power allocation 
policy is defined on the complete system state x = (H, Q)G X, where X = T-Lx Q. Therefore, Problem 
[T] is a constrained POMDP (CPOMDP) with the following specification: 

• State Space: The state space is given by {(H, Q) : VH e "H, Q G Q}, where (H, Q) is a realization 
of the global system state. 

• Action Space: The action space is given by {0(H, Q) : VH G "H, Q G Q}, where VL = (i7c, ^p) is 
a unichain feasible policy as defined in Definition [T] 

• Transition Kernel: The transition kernel Pr[x'|x, is given by Q. 

• Per-stage Cost Function: The per-stage cost function is given by d{x-,^{x)) = J2b k l^b,kf{Qb,k)- 

^The positive weighting factors f3 in ^ indicate the relative importance of buffer delay among the BK data streams and 
for each given (3, the solution to (|9} corresponds to a point on the Pareto optimal delay tradeoff boundary of a multi-objective 
optimization problem. 
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• Observation: The observation for the pattern selection policy is GQSI, i.e., Oc = Q, while the 
observation for the power allocation policy is the complete system state, i.e. Op = X- 

• Observation Function: The observation function for the pattern selection policy is 

Oc{oc, X, (^^c(Q')' ^p(x'))) = 1' if = otherwise 0. Similarly, the observation function for the 
power allocation policy is Op[op, x, {^c{Q'), ^p(x'))) = if = X' otherwise 0. 
For any Lagrangian multiplier (LM) vector 7 = (71, • • • ,7b) (V7fc > 0), define the Lagrangian as 

Lpin, 7) = lim - J] [5(7, x{t), n{x{t))) 

t=l 

where gij, x, ^{x)) = EbeB ( ^keic Pb,kf{Qb,k)+lb{Ph - Pb)^ ■ Therefore, the corresponding uncon- 
strained POMDP for a particular LM vector 7 (i.e. the Lagrange dual function) is given by 

G{-f) = mm L^{Q,-f) (10) 

The dual problem of the primal problem in Problem [T] is given by max-^^o ^(7). It is shown in |fT5l| 
that there exists a Lagrange multiplier 7 > such that Q,* minimizes L^(il,7) and the saddle point 
condition L^(0*,7) < L^(il*,7*) < L^(0,7*) holds, i.e. (il*,7*) is a saddle point of the Lagrangian 
L^(il,7). Using standard optimization theory |[T0| . $7* is the primal optimal (i.e. the optimal solution 
of the original Problem 1), 7* is the dual optimal (i.e. the optimal solution of the dual problem), and 
the duality gap (i.e. the gap between the primal objective at Q,* and the dual objective at 7*) is zero. 
Therefore, by solving the dual problem, we can obtain the primal optimal Q*. 

D. Equivalent Bellman Equation 

While POMDP is a very difficult problem in general, we shall exploit some special structures in our 
problem to substantially simplify the problem. We first define conditional power allocation action sets 
below: 

Definition 2 (Conditional Power Allocation Action Sets): Given a power allocation policy Qp, we de- 
fine a conditional power allocation set 0,p{Q) = {p = ^p{x) '■ X = (Q5H),VH} as the collection of 
actions for all possible CSI H conditioned on a given QSI Q. The complete control policy Qp is therefore 
equal to the union of all the conditional power allocation action sets. i.e. ftp = |Jq r2p(Q). ■ 

Based on Definition |2l we can transform the POMDP problem into a regular infinite -horizon average 
cost MDP Furthermore, for a given 7, the optimal control policy Q,* can be obtained by solving an 
equivalent Bellman equation which is summarized in the lemma below. 
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Lemma 1 (Equivalent Bellman Equation and Pattern Selection Q-factor): For a given LM 7, the op- 
timal control policy = for the unconstrained optimization problem in Problem [T] can be 
obtained by solving the following equivalent Bellman equation: (VQ G Q,VC G C) 

Q(Q, C) = min {g{j, Q, C, ^lp{Q)) + V Pr[Q'|Q, C, 17p(Q)] rnin Q(Q', C7')| - ^ (H) 

where {Q(Q, C)} is the Pafffrn Selection Q-factor, 5(7, Q, C, l^p(Q)) = E[5(7, (H, Q), C, Jlp(H, Q)) 
IQ] is the conditional per-stage cost, Pr[Q'|Q, C, S7p(Q)] = E[Pr[Q'|(H, Q), C, Jlp(H, Q)]|Q] is the 
conditional expectation of transition kernel. If there is a (0,{Q(Q,C)}) that satisfies the fixed-point 
equations in ([TT]) . then = L*^{'j) = min^ L^(il,7) is the optimal average cost in Problem [T] Further- 
more, the optimal control policy is given by il*(7) = {Q*,^*) with l^p(Q) attaining the minimum of 
the R.H.S. of (dB (VQ e Q, VC G C) and Q*(Q) = arg mine Q(Q, Q. ■ 
Proof: Please refer to the Appendix A. ■ 
Remark 2: The equivalent Bellman equation in (fTTI) is defined on the GQSI Q with cardinality Iq 
only. Nevertheless, the optimal power allocation policy U* = |Jq ^^^(Q) obtained by solving ([TT]) is still 
adaptive to GCSI H and GQSI Q, where ^^^(Q) are the conditional power allocation action sets given 
by Definition 121 We shall illustrate this with a simple example below. In other words, the derived policies 
of the equivalent Bellman equation in ([TT]) solve the CPOMDP in Problem [T] ■ 

Example 1: Suppose there are two MSs with the CSI state space T-L = {Hi, H2] as a simple example. 
As a result, the global CSI state space is = {Hi, H2}^ with cardinality Ih = 4. Given GQSI Q, the 
optimal conditional power allocation action set ilp(Q) = {p*(H,Q) : H G H} (by Definition [2l) for 

any given pattern C is obtained by solving the R.H.S. of (ITTT ). 

4 

f];(Q) = arg min ^ { J] Pr[H«](5(7, (H«, Q), f)p(H», Q)) 

(p(H('),Q))|^^^ i=l 

+ Pr[Q'|H», Q, C, Op(H», Q)] minQ(Q', C')) } 
Q' 

Observe that the R.H.S. of the above equation is a decoupled objective function w.r.t. the variables 
{p(H»,Q)} and hence 

P*(H«,Q)) 

= arg min (5(7, (H», Q), C, Qp(H», Q)) + V Pr[Q'|H«, Q, C, J)p(H«, Q)] iiiinQ(Q', C')\ 
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Hence, using Lemma [T] the optimal power control policy is given by 0* = |Jq Q*{Q), which are 
functions of both the GQSI and the GCSI. The optimal clustering pattern selection is given by J^c(Q) = 
argminc Q(Q, C), which is a function of the GQSI only. ■ 

IV. General Low Complexity Decentralized Solution 

The key steps in obtaining the optimal control policies from the R.H.S. of the Bellman equation in 
([TTI ) rely on the knowledge of the pattern selection Q-factor {Q(Q, C)} (which involves solving a system 
of IcIq non-linear Bellman equations in ([TTI ) for given LMs with IcIq + 1 unknowns {9, {Q(Q, C)})) 
and the B LMs {7b : 6 G B}, which leads to enormous complexity. Brute-force solution has exponential 
complexity and requires centralized implementation and knowledge of GCSI and GQSI (which leads to 
huge signaling overheads). In this section, we shall approximate the pattern selection Q-factor by the sum 
of Per-cluster Potential functions. Based on the approximation structure, we propose a novel distributive 
online learning algorithm to estimate the Per-cluster Potential functions (performed at each CM) as well 
as the LMs {75 : 6 G fi} (performed at each BS). 

A. Linear Approximation of the Pattern Selection Q-Factor 

Let Qn denote the CQSI state space of cluster n with cardinality Iq^ = (Nq + 1)1'^" 1^ . To reduce 
the cardinality of the state space and to decentralize the resource allocation, we approximate Q(Q, C) 
by the sum of per-cluster potential Vn{Qn) O^i-^n G C), i-e- 

Q(Q,C) « ^n{Qn) (12) 

where Q„ G Qn is the CQSI state of cluster n (tOn G C) and {Vn{Qn) '• VQ„} are per-cluster potential 
functions which satisfy the following per-cluster potential fixed point equation: (VQ„ G Qn) 

On + Vn{Qn)= mm jg^ (7,, Q„ , Op„ (Q„)) + J] Pr [Q^ | Q„, J]p„ (Q„)]F„ (Q^) } (13) 

where 

ff„(7„, Q„, QpAQn)) =^[Y.{Y1 l^b,kf{Qb,k) + 76(n(H„, Q„) - n 



(14) 



<;„(7„,H„,Q„,f7p„(H„,Q„)) 

Pr[Q:,|Q„,Op„(Q„)] =E[Pr[Q:,|H„,Q„,Op„(H„,Q„)]|Q„], (15) 

In = {lb ■■ b e Un}, Pb{iln,Qn) = Eb'eu;„ Efc (b' ,k) ,bP Pb' ,k(^n , Qn) givcn by ©, p„ = {pfo^fc : 

b G oJn, k G JC}. 
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In the literature, there are mainly three types of compact representations, which can be used to ap- 
proximate the potential functions [?], |fT2l: Artificial neural networks. Feature Extraction, and Parametric 
Form. The first two approaches still need (GCSI,GQSI), have exponential complexity with respect to B 
and K, and do not facilitate distributed implementations. Therefore, we adopt Parametric Form with linear 
approximation. Due to the cluster-based structure and the relationship between the GQSI and the CQSI, 
we can extract meaningful features and use the summation form for approximation, which naturally lead 
to distributed implementation. Using the above linear approximation of the pattern selection Q-factor by 
the sum of per-cluster potential functions in (fT2l ). the BSC determines the optimal clustering pattern 
based on the current observed GQSI Q according to 

0*(Q) = argmm J] F„(Q„) (16) 

Based on the CQSI and CCSI observation (H„, Q„), each CM n (a;„ G ^^c(Q)) determines O*^ (Qn) = 
{0* JH„, Q„) : (H„, Q„)VH„}, which attains the minimum of the R.H.S. of (O (VQ G Q, VC G C). 
Hence, the overall power allocation control policy is given by = {ri*^(H„, Q„) : a;„ G r2*(Q)}. 

B. Online Primal-Dual Distributive Learning Algorithm via Stochastic Approximation 

Since the derived policy Vl* = (O*, 0*) is function of per-cluster potential functions {Vn{Q,n)} (Vwn), 
we need to obtain {Vn{Q,n)} by solving (fT3l ) and determine the LMs such that the per-BS average power 
constraints in Q are satisfied, which are not trivial. In this section, we shall apply stochastic learning 
|[T3l . |[T4l to estimate the per-cluster potential functions {^^(Qn)} (Va;«) distributively at each CM 
based on realtime observations of the CCSI and CQSI and LMs at each BS based on the realtime power 
allocations actions. The convergence proof of the online learning algorithm will be established in the 
next section. 

Fig. |3] illustrates the top level structure of the online learning algorithms. The Online Primal-Dual 
Distributive Learning Algorithm via Stochastic Approximation, which requires knowledge on CCSI and 
CQSI only, can be described as follows: 

Algorithm 1: {Online Primal-Dual Distributive Learning Algorithm via Stochastic Approximation) 

• Step 1 [Initialization]: Set t = 0. Each cluster n initialize the per-cluster potential functions 
{K°(Qn)} (Wn). Each BS h initialize the LM 7° (V6 G B). 

• Step 2 [Clustering Pattern Selection]: At the beginning of the t-th slot, the BSC determines the 
clustering pattern C{t) based on GQSI Q(t) obtained from each BS according to ( fT6l ). and broadcasts 
C{t) to the active CMs of the clusters ujn G C{t). 
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• Step 3 [Per-cluster Power Allocation]: Each CM n (Va;„ G C{t)) of the active cluster obtains 
CCSI H„(t), CQSI Qn(t) and LMs 7^ (b G a;„) from the BSs in its cluster, based on which, each 
CM n (y/oon G C{t)) performs power allocation p„(t) = {pb,k{t) '■ b G ujn,k G /C} according to 

j^;jH„(t),Q4t)). 

• Step 4 [Potential Functions Update]: Each CM updates the per-cluster potential V^^^ [Qn{t)) 
based on CQSI Qn(i) according to (ITtI) and reports the updated potential functions to the BSC. 

• Step 5 [LMs Update]: Each BS 6 (5 G B) calculates the total power Pb{t) based on {pb,k{t)} from 
its CM according to (|2ll and updates its LM 7^"*^^ according to ( fTSl ). 

The per-cluster potential update in Step 4 and the LMs update in Step 5 based on CCSI observation 
H„(t) and CQSI observation Qn(i) at the current time slot t are further illustrated as follows: 

^^(Qn) =^l(Qn) +er.(t)^n(Qn) " l[Qn(t) = Q^], Vi G {1, • • • , /q J (17) 

li+'=T[^l + e2{Pb{t)-Pb)) (18) 

where y„*(Qj,) = (5„(7f,,, H„(t), Q„(t),p„(t)) + F^(Q„(t + 1))) - (<7„(7^„ H„(t„), Q^,p„(t„)) + 

v'n{Qn{in + 1)) " K*(Q^)) " F*„(QD> ^U*) = ELol[Qn(*') = Qn>^n, G C(t')] is the number of 
updates of F„(Qj,) till t, p„(t) = {pb,k{t) : 6 G t^„, A; G /C} = i7*^(H„(i), Q„(t)), is the reference 
statq^, tn = sup{t : Qn(t) = Qnji^n £ C(i)} is the last time slot that the reference state y„(Q^) was 
updated, r(-) is the projection onto an interval [0, i?] for some B > Q. {ej} and {e^} are the step size 
sequences satisfying the following equations: 

6^ = oo, > 0, ^ 0, e7 = oo, ej >0,e]^ 0, + (elf) < oo, 4 ^ (19) 

t t t 

C. Convergence Analysis for Distributive Primal-Dual Online Learning 

In this section, we shall establish the technical conditions for the almost-sure convergence of the 
online distributive learning algorithm in Algorithm [T] Let V„ denote the /q^ -dimensional vector form 
of {VniQn)}- For any per-cluster LM vector 7„, define a vector mapping T„ : i?!*^"! x R^'^" — 
of cluster 7i with the i-th (1 < i < Iq^) component mapping as: 

Tn,ihn,^n) = ^ mm hXln, QI ^pM)) + ^ PAQUQI ^pAQDW n{Qi)} (20) 
^Without lost of generality, we set reference state — (Va;„), i.e. buffer empty for all MSs in cluster n, and initialize 
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Define A^-^ = e^„iP^ + (l-eJ'„i)I and B^-^ = e^_^P^-^ + {l-e^„^)I, where is a Iq^xIq^ average 
transition probability matrix for the queue of cluster n with Pr[Qn|Q^, Pn(Qn)] = lE[Pr[Qn|H„, Q^, 
(H„,Q^)]|Q^] as its -element and I is a Iq^ x Iq^ identity matrix. 

Since we have two different step size sequences {e^ } and {e^} with e] = o{e^), the per-cluster 
potential updates and the LM updates are done simultaneously but over two different timescales. During 
the per-cluster potential update (timescale I), we have 7^"^^ —7^ = e(t)(V6), where e(t) = 0{ej) = o(e^). 
Therefore, the LMs appear to be quasi-static |[T5l during the per-cluster potential update in (fTTl ). We first 
have the following lemma. 

Lemma 2: (Convergence of Per-cluster Potential Learning (Time Scale I)) Assume that for every set 
of feasible control policies Oi, • • • , 0^+1 in the policy space, there exist a 5^ = 0{e^) > and some 
positive integer m such that 

[A™ • • • A\]u > 5^, [B™ • • • ^l]u >5m, 1 < i < /q„, Vo;, (21) 

where denotes the (i, /) -element of the corresponding Iq^ x Iq^ matrix. For stepsize sequence {e"} 
satisfying the conditions in ([T9l ). we have limf_j.oo = V^(7„) a.s. (VtJ„) for any initial potential 
vector V^(7„) and per-cluster LM vector 7„, where the steady state per-cluster potential V^(7„) 
satisfies: 

(r„,,(7n,Vr(7n)) -^"(Qn)(7n))e + Vr(7j = T„ (7,, (7 J ) , yu^n (22) 



Proof: Please refer to Appendix B for the proof. ■ 
Remark 3 (Interpretation of the Conditions in Lemma |2]).- Note that A^ and B^ are related to an equiv- 
alent transition matrix of the underlying Markov chain. Condition in (|2TI ) simply means that state 
is accessible from all the states after some finite number of transition steps. This is a very mild 
condition and will be satisfied in most of the cases we are interested. ■ 
On the other hand, during the LM update (timescale II), we have limt-s-oo ~^^(7n)ll — w.p.l. 
by the corollary 2.1 of |[T6l . Hence, during the LM update in ( [35] ). the per-cluster potential is seen as 
almost equilibrated. The convergence of the LM is summarized below. 

Lemma 3: (Convergence of LM over Timescale II): For the same conditions in Lemma |2l we have 
limj_j.oo 7* = 7°° a.s. where 7°° satisfies the power constraints of all the BSs in ■ 
Proof: Please refer to Appendix C for the proof. ■ 
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Based on the above lemmas, we can summarize the convergence performance of the onUne per-cluster 
potential and LM learning algorithm in the following theorem: 

Theorem 1: (Convergence of Online Learning Algorithm) For the same conditions as in Lemma |2j we 
have (V^,7^) — ;> (V^,7^) a.s. (Va;„,6 G w„,A; G /C), where and 7jf satisfy 

T.^Aln,^n)-K{Q'S)e+\^ = T„(7;f , V:^), Va;„ (23) 



and the power constraints of all the BSs in ■ 

V. Application to Network MIMO Systems with Poisson Arrival 

In this section, we shall illustrate the online primal-dual distributive learning algorithm for network 
MIMO systems under Poisson packet arrival and exponential packet size distribution. 

A. Dynamics of System State under Poisson Packet Arrival and Exponential Distributed Packet Size 

Under Poisson assumption, we could consider packet flow rather than bit flow. Specifically, let A(t) = 
{Ab^t) : b € B,k £ IC} and N(t) = {Nb,k{t) : & G B, /c G /C} be the random new packet arrivals and 
the corresponding packet sizes for the BK users in the multicell network at the end of the t-th scheduling 
slot, respectively. Q(t) and Nq denotes the GQSI matrix (number of packets) and maximum buffer size 
(number of packets). 

Assumption 3 (Poisson Source Model): The packet arrival process Af,^k{t) is i-i-d. over scheduling slots 
following Poisson distribution with average arrival rate E[^;, fc] = Xf, k, and independent w.r.t. {{b,k)}. 
The random packet size Ni,^i.{t) is i.i.d. over scheduling slots following an exponential distribution with 
mean packet size N^^k^ and independent w.r.t. {{b,k)}. ■ 
Given a stationary policy, define the conditional mean departure rate of packets of MS (6, k) at the 
t-th slot (conditioned on x{t)) as f^b,k{x{t)) = E[Rb,k{x(.t)) / NhM^)] = Rb,k{x{t)) /Nh^k- 
Assumption 4 (Time Scale Separation): The slot duration r is sufficiently small compared with the 
average packet interarrival time as well as conditional average packet service timqj, i.e. Xb,kT ^ 1 and 
f^b,k{x{t))r <^l. ■ 
By the memoryless property of the exponential distribution, the remaining packet length (also denoted 
as N(t)) at any slot t is also exponential distributed. Given a stationary control policy f], the conditional 

'This assumption is reasonable in practical systems, such as WiMax. In practical systems, an application level packet may 
have mean packet length spanning over many time slots (frames) and this assumption is also adopted in a lot of literature such 
as [?], [?]. 
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probability (conditioned on of a packet departure event at the t-th slot is given by 



Pr 



Rb,k{t) 



<T\x{t),n{x{t)) 



Pr 



— < ^^b,k{x{t))r 
L ^^b,k 



1 - exp(-/i6_fc(x(t))r) fib,k{x{t))T 



where the last equality is due to Assumption IH Note that under Assumption IH the probability for 
simultaneous arrival, departure of two or more packets from the same queue or different queues and 
simultaneous arrival as well as departure in a slot are 0[Xb.kT ■ \b',k'T), C'((/Xb^fc(x(t))r • {fJ-b' ,k' (xi't))''') 
and O^Xb^kT • fJ'b,k{x{'t))T) respectively, which are asymptotically negligible. Therefore, the transition 
kernel of the QSI evolution in this example can be simplified as: 

Xb,kT if Qn = Qn + eb,k 

E[/ife_fc(H„, Q„)|Qn]r if = Qn - e^^fc 

1 - Ebec^„ EfceiC (lEKfc(Hn, Qn)|Qn] + Afe,fc)r if Q;, = Q„ 

(24) 

where /ib /c(H„,Q„) = i?5 fc(H„, Q„)r/A^f„fc and e^ ^ denotes the x K matrix with element 1 
corresponding to MS (6, k) and all other elements 0. 



Pr[Q:,|Q„,l]p„(Q„)] 



B. Decomposition of the Per-cluster Potential Function 

Observe that the cardinality of the per-cluster system state Iq^ = {Nq + l)!"^"!^ is still exponential 
in the number of all the MSs in cluster n, i.e. |a;„|i('. In the following lemma, we shall show that the 
per-cluster potential can be further decomposed into per-cluster per-user potential, which leads to linear 
order of growth in the cardinality of the state space, i.e. \LOn\K{NQ + 1). 

Lemma 4 (Decomposition of Per-cluster Potential): The per-cluster potential Vn{Q,n) O^^n) defined 
by the fixed point equation in (fT3l ) can be decomposed into the sum of the per-cluster per-user potential 



functions {V rb,k){Q)}' i-^- ^"(Qn) = E 



b£uj„,keK. ^ n,{b,k) 

following per-cluster per-user potential fixed point equation: {\/Q G Q) 



yn,{b,k){Qb,k), where {V n,{b,k){Q)] satisfy the 



i,(b,k) + ^n,(fe,fc)(Q) = 



where gn,(b,k) (7„,g,j^p.,,(Q)) =^\f3b,kf{Q)+Pb,k{^n,Q) 



^{b,k),b' 



Q 



(25) 



(26) 



b'eu}„ 



3,.,(6,fc)(7„.H„,Q,Pb,fc(H„,Q)) 
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Pt[Q'\Q, Qp,^, (Q)] = E [ Pr[Q'|H,, Q, Qp,^ (H„, Q)]\Q] 
\kT if Q' = Q + 1 

E[^fe,fc(H„,Q)|Q]r ifQ' = Q-l (27) 

^ 1- (E[/ib,fc(H„,Q)|Q] +Ab,fc)r if Q' = Q 

where //fe,fc(H„, Q) = -Rfe,fe(H„, Q)T/Nb,k- ■ 
Proof: Please refer to Appendix D for the proof. ■ 

C. Per-Stage QSI-aware Interference Game for Power Allocation at the CMs 

In order to determine the power control action (Step 3 in Algorithm [T]) in a distributive manner at each 
CM, we shall formulate the power allocation of each cluster n (tOn G C) as a non-cooperative game ifTTl . 
The players are the CMs, and the payoff function for each cluster n (a;„ G C) is defined as 

ri) P-n) = 5n(7n, H„, Q„, fip„(H„, Qn)) + Pr[Q;|H„, Q„, (H„, Qn)]l^n(Qn) 

Each CM is a player in the game specified by: 

{G) : minfi„(p„,p_„), Va;„ e C (28) 

Pn 

where p„ = {pb,k ■ b G cOn, k £ K.} is power allocation of cluster n (a;„ G C) and p_„ = U(^^,gc',nVnPn 
is the power allocation of all other clusters indirectly observed through interference measure from MSs 
in cluster 7i. It can be shown that the solution of the game Q in (|28l) . i.e. Nash Equilibrium (NE) can be 
characterized by the following fixed-point equation: 

= WF„,(,,fc)(p^lJ, ybeujn,ke}C,ujneC (29) 

where the waterfilling operator W Fn^(^b,k){') Q^b G ujn,k G /C) is defined as: 



J^^^n,{b,k){Qb,k) 
J2b'eLC„ 76' II ^{b,k),b' 



^i^n,(M)(P-n) = 'rr^ ^ - (1 + (30) 



I 1 2 

where In,(b,k) = E^^n-eC Eb"eL^„/ (Eb'ea;,., Wb,k),b'^{b" ,k"),b'\ )Pb",k" is the inter-cell interference mea- 
sured by the MS (5, k) in cluster n (V6 G /c G /C, tOn G C). For notation convenience, let the BK x 1 
vector p and WF denote the vector form of p^^k and VKF„ (V6 G ujn, k ^ IC,uJn ^ C), respectively. 

We propose a QSI-aware Simultaneous Iterative Water-filling Algorihtm (QSIWFA) to achieve the 
NE of the game Q distributively. At each iteration, given the measurement of interference generated 
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by other clusters in the previous iteration, the overall power allocations are updated by the active CMs 
simultaneously according to 

p^+i=WF(p'') (31) 

Remark 4 (Multi-level Water-filling Structure of QSIWFA): The waterfilling operator W^i^n,(b.fc)(") in 
(l30l) is function of both CCSI and CQSI. It has the form of multi-level water-filling where the power is 
allocated according to the CCSI in terms of 'W(b,k),b' but the water-level is adaptive to the CQSI (indirectly 
via AVn,{b,k){Qb,k))- ■ 

Next, we shall discuss the existence, uniqueness of the NE of game Q and convergence of the multi-level 
QSIWFA in (|3T]) . Define the BK x BK matrix S with its element [S](^h,k),(b',k') given by: 



[S](6,fc),(b',fc') 



yb£uOn,b eiOn',k,k G /C (32) 

if n' = n 



Given some BK x 1 vector u with each component positive, let || • HJ^vec ^i^^ II " I loo mat denote 
the vector weighted maximum norm and the matrix norm defined in ifTTl . separately. Then, we have 
l|WF(p)||^^,,, = maxfe,fc ^^^^^"f^^^' and ||S||^^^,j = inaxb,k^Eb'eB,k'eKi^]{b,k),ib',k')Ub',k', where 
S G M^^^^^, and [■]{b,k),{b',k') denotes the element in the row corresponding to MS {b,k) and the 
column corresponding to MS ih' ,k'). The convergence of the QSIWFA in (|3T]) is summarized in the 
following lemma. 

Lemma 5: (Convergence of the QSIWFA) If ||S||J^,„a( < 1 is satisfied for some u > 0, then the 
mapping WF is a contraction mapping with modulus a = ||S||^^at w.r.t the norm || • HJ^vec- The NE 
of game Q exists and is unique. As z/ — )• oo. The QSIWFA in ( [3T] ) converges to the unique NE of game 
Q which is the solution to the fixed point equation in ( |29l ). ■ 
Proof: Please refer to Appendix E for the proof. ■ 



Remark 5 (Interpretation of Sufficient Condition for QSIWFA Convergence): The intuitive meaning for 
the condition ||S||^^at < 1 that the inter-cluster interference is sufficiently small compared with the 
signal power from cooperative BSs in the same cluster fTTl. This happens with high probability because 
the interference from the inter-cluster BS has been reasonably attenuated due to the geometry of the cluster 
topology. Fig. |4] also illustrates that the condition HSHJ^nj^t < 1 can be satisfied with high probability. 

D. Compact Queue State in Online Primal-Dual Distributive Learning Algorithm 

To further reduce the memory size as well as the frequency of clustering updates at the BSC, we shall 
use the feature-based linear architecture |[T8l to approximate the original per-cluster per-user potential 
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functions. Specifically, we define the following compact queue state. 

Definition 3 (Compact Queue State): Define the compact queue state as Q = qd (q = 0, ■■■ ,lq = 
L^J})' where d (d < NQ,d G A/""*") is the corresponding resolution level. The approximate potential 
functions of the compact queue states {Vn^(^b,k){'l) '■ Q = " i^q} defined as compact per-cluster 
per-user potential functions. ■ 
Therefore, the linear approximation of the original per-cluster per-user potential functions is given by 

Vn,{b,k){<ld + l) = Vn,{b,k){q) + + 1) - 

V/G{0,--- {O,--- ,^,-1} (33) 

Let \n,{b,k) = C5^n,(fe,fc)(0),--- .Vn,{b,k){NQ)Y and V„^(fe,fc) = (K,{6,fc)(0), • • • , K,(b,fc)(^g))^ be the 
vector form of the per-cluster per-user potential functions and the compact per-cluster per-user potential 
functions of MS (fe, k) in cluster uj^, respectively. Accordingly, their relationship is given by V„ = 
MVn,(b,fc) and ^n,{b,k) = '^'^^ n,{b,k)^ whcrc M is the [Nq + I) X [Iq + l) matrix with (gd+Z, -element 
{qd + 1, q + 1) -element ^, (/^cZ, /q)-element 1 (Vg G {0,/g — 1}, V/ G {0,d— 1}) and all other elements 
0, and is {Iq + 1) x [Nq + 1) matrix with (q^, q'(i)-element (Vg G {0,/g}) 1 and all other elements 

Applying Algorithm [T] to estimate the compact per-cluster per-user potential functions and LMs with 
per-stage QSI-aware interference game in Section IV-CI for power allocation, we obtain the distributive 
online learning algorithm for Poisson arrival and exponential packet size distribution as illustrated in 
Fig. |3] Specifically, the compact per-cluster per-user potential update and LMs update based on CCSI 
observation H„(t) and CQSI observation {Qn,{b,k){'t) ■ V6 G cOn, k G /C}) are given by: 

^n,tifc)(9) =^n,(fe,fc)(9) + er^,,,,,W^n,(M)(9) " = G C{t)] (34) 



K,(b,k)il) =\9n,{b,k){ln^^n{t),Qb,k{t),Pb,k{t)) + Pb,kT 

+ H^n,{b,k)\t)\ 



d J 

ll^'=TU + el{P^{t)-P,)) (35) 



'"Note that when d = 1, M and become {Nq + 1) x [Nq + 1) identity matrix, and the feature state space is equivalent 
to the original state space. In other words, V„_(i, ^) = ^n,{b,k)- ^n,(b,k) can be obtained by online learning via stochastic 
approximation. 
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where = Ylt'=o'^iQb,k{t') = qd,Un G C{t')] is the number of updates of till t, 

Pbdt) = ( E':L„7^'l|w(l),..(t)|p - iL' ^n,(b,fc)(i) = {QbMt + 1) = + l.'^n G Cit)} is the 

arrival event, qj is the reference statq^ t = sup{t : Qb,k{t) = ,uJn G C'(^)} is the last time slot 
that the reference state K,(fe,fc)(90 was updated, P^{t) = Y.h'&u^ Y^kaK W'^ib' ,k),bf ' Pb',ki^) *e ^^^1 
transmit power of BS 6 determined by the per-stage QSI-aware interference game. 

VI. Simulation Results and Discussions 

In this section, we shall compare the proposed two-timescale delay-optimal dynamic clustering and 
power allocation design with three baselines. Baseline 1 refers to the Fixed Channel Assignment (FCA) 
without cooperation among BSs in standard cellular systems with frequency reuse factor (FRF) 7. Baseline 
2 and Baseline 3 refer to static clustering and greedy dynamic clustering m-lH. For any given clustering 
pattern, optimal power allocation is performed at each BS (Baseline 1) or CM (Baseline 2 and Baseline 
3) based on available instantaneous CSI to maximize sum throughput of the cluster. In the simulation, 
we consider a cellular system with 19 BSs, each has a coverage of 500m. We apply the Urban Macrocell 
Model in 3GPP with path loss model given by PL = 34.5 + 'ib\ogiQ{r), where r (in m) is the distance 
from the transmitter to the receiver. Each element of the small scale fading channel matrix is CA/'(0, 1) 
distributed. The total BW is lOMHz. We consider Poisson arrival with average arrival rate Af, ^ (pck/slot). 
The scheduling slot duration r is 5ms. The maximum buffer size Nq is 9 and the mean packet size 
iVfe fc = 264 Kbyte. 

Fig. [5] (a) illustrates the average delay per user versus transmit power with Nt = ^ and Nt = 2 and the 
maximum cluster size Nb = 3. The average delay of all the schemes decreases as the transmit power or 
the number transmit antenna increases. Observe that the performance of Baseline 1 is inferior to that of 
BaseUne 2 and 3, and this illustrates the gain behind base station cooperations in network MIMO systems. 
Furthermore, Baseline 3 outperforms Baseline 2, illustrating the benefit of dynamic clustering in network 
MIMO. Finally, there is significant performance gain of the proposed scheme compared to all baselines. 
This gain is contributed by the QSI-aware dynamic clustering as well as the QSIWFA for power control. 
Fig. |5] (b) illustrates the average delay versus average transmit power under different maximum cluster 
size Nb = 3 and A'^^ = 2. Similar observations about the performance gain could be made. Table |ll] 
illustrates the complexity in terms of the CPU time of the baselines and the proposed scheme. It can be 

"without lost of generality, we set reference state q' ~ = 0, and initialize the k){'l^) ~ 0, Vn, 6, k. 
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seen that the the proposed scheme can achieve significant performance gain with reasonable complexity 
compared with the complexity of the baselines. 

Fig. [6] illustrates the average delay versus per user loading (average arrival rate Af, fc) at transmit power 
of Pb = 30 dbm and number of MS per BS A' = 1, 2. The proposed scheme achieved significant gain 
over all the baselines across a wide range of input loading. 

Fig. |7] illustrates the convergence property of the proposed online learning algorithm for estimating the 
Compact Per-cluster Per-user Potential Functions {V^ ^^}. We plot the transient of potential function 
versus slot index at a transmit power = 30 dbm. It can be observed that the proposed distributive 
learning algorithm converges quite fast. The average delay corresponding to the the 500-th scheduling 
slot is 2.4069 pck, which is quite close to the optimal delay and is much smaller than the other baselines. 

VII. Summary 

In this paper, we propose a two-timescale delay-optimal dynamic clustering and power allocation design 
for the downlink network MIMO systems. We show that the delay-optimal control can be formulated 
as an infinite-horizon average cost CPOMDP and derive an equivalent Bellman equation to solve the 
CPOMDP. To address the distributive requirement and the issue of exponential memory requirement 
and computational complexity, we propose a novel distributive online learning algorithm performed to 
estimate the distributive potential functions as well as the LMs. We also show that the proposed distributive 
online learning algorithm converges almost surely (with probability 1). We formulate the instantaneous 
power allocation as a Per-stage QSI-aware Interference Game played among all the CMs and propose a 
QSI-aware Simultaneous Iterative Water-filling Algorithm (QSIWFA) to achieve the NE. The proposed 
algorithm achieves significant performance gains over all the baseUnes due to the QSI-aware dynamic 
clustering and QSIWFA power control. 

Appendix 



Appendix A: Proof of Lemma[T] 

Given any stationary pattern selection policy Q,c, by standard MDP techniques, the optimal power 
allocation policy r2p(H, Q) can be obtained by solving the following Bellman equation: 

en^ + VnAil,Q)= min {5(7, H, Q, 1^^, f^p) + V Pr[H', Q'|H', Q', l}p]Ff,^(H', Q')} 
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'■^ mill {<7(7,H,Q,J7„Op) + J]Pr[Q'|H,Q,fi„Op] J]Pr[H']yo.(H',QO} 
E[^ + y(H,Q)|Q] =e[ mill {5(7, H, Q, J^e, + V Pr[Q'|H, Q, J^^, f^p] V Pr[H']y(H', Q') q1,VQ 



where {Vh<,(H, Q)} are the associated potential functions, (a) is due to Q and (b) is obtained by taking 
conditional expectation (average over H conditioned on Q) on both sides, due to the i.i.d. assumption on 
GCSI H in Assumption [U The optimal Q,c can be obtained by solving the following Bellman equation: 

E[0 + y(H,Q)|Q] = minE[ min {5(7, H, Q, 1)^, f^p) + V Pr[Q'|H, Q, f^c, ^^p] V Pr[H']y(H', Q'^ 

where {1^(H, Q)} are the associated potential functions. Define V{Q) = E[V{H, Q)|Q], we can obtain 
the equivalent Bellman equation: 

^ + F(Q)=minE[ min {5(7, H, Q, fi^, ^^p) + J] ^^[^'1^, Q> ^^p]F(Q') Q 



Q ,VQ 



f1c(Q) 



min {g{-f,Q,^c,^p) + J2Pr[Q'\Q,n„np]V{Q')}, VQ (36) 
^c(Q)5^p(Q) 

where (c) is due to the definition of "conditional action sets" in Definition [2] Let 17* = (17*, Q*) denote 
the optimal control policy minimizing R.H.S. of (l36l) at any state Q, and 9 = L*p{'j) = inf^ L/3(r2, 7) 
is the optimal average cost per stage. By Definition |2l we have the associated original control policy 
r2*(x) = (f7*(Q), r2*(x)), which solves Problem [T] and hence, 6 is also the optimal average cost per 
stage of Problem [T] Due to the discrete nature of pattern selection, we introduce the Pattern Selection 
Q-factor Q(Q, C) (VC € C, VQ) to facilitate the pattern selection, which is defined as 

Q(Q, C) ^ mm {5(7, Q, C, 17p(Q)) + J] Pr[Q'|Q, C, 17p(Q)]F(Q')} - 9 (37) 



Q' 



(Je 



A min {5(7,(H,Q),C,f7p(H,Q)) + J]Pr[Q'|(H,Q),C,J7p(H,Q)]y(Q')} 



Q 



=R.H.S. of dni) 



where (d) is due to Definition El Therefore, F(Q) = mincQ(Q,C), ^7*(Q) = argmincQ(Q, C) and 
{Q(Q,C)} satisfies ([IB- 
Appendix B: Proof of Lemma [2] 

Since the per-cluster potential function {V^l(Qn)} of each queue state Q„ is updated comparably 
often [?], the only difference between the synchronous and asynchronous update is that the resultant 
ODE of the asynchronous update is a time-scaled version of the synchronous update [?], which does not 
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affect the convergence behavior. Therefore, we consider the convergence of related synchronous version 
in the following. In the following proof, we shall use i {1 < i < Iq^) instead of for simplicity, since 
7„ is quasi-static over the timescale I, we shall omit 7„ in the arguments of the function T„(-). 

We shall first show the convergence of the martingale noise in (Yl\ . Let and Pr^ denote the 
expectation and probability conditioned on the cr-algebra Ft, generated by {v|^,Y^,/c < t}. Define 
Si{i) ^ Et(y„*(i)) = Tn,i{\n) - V\i) - Tn,i{\i) - Vn\l)), whcrc is the noise corrupted 

observation of 5* (i) given in dTTb . Define 5M^{i) = y^(i) — which is the martingale difference 

noise with property that Et[(5M*(i)] = and E[(5M* (i)(5M*' (i)] = 0,Vt / t' . For some k, define 
^n(0 = T!i=k^'i^^n(^)- Thus, from we have 

t 

Vn\i) = Viii) + e^S^i) + 6Mi{i)] = vl{i) + J] e^S\i) + M*(i) (38) 

l=k 

Since Et[M*(i)] = M^~^{i), {M^(i)} is a martingale sequence. By Kolmogorov's inequahty, we have 
Prfc{supfc<Ki|M^(i)| > A} < ^"ll^^^^i'^l^^ < Jm;;_ By the boundedness assumption of 6M^{i) (Vt,i) 
and the property of the martingale difference noise as well as the condition on the stepsize sequence in 

m, wehaveE,[|A./*(.)|2] = Zlk^l'^Km = EU E,[(6[')2(5A4(0)'] < ^MnElki^lf ^ 
limfc^o, PrfcSupfc<Kj |M^(i)| > A = 0. Thus, as A: ^ oo, (EUl goes to vl^\i) = + ZUk 

with probability 1. The vector form of update is given by: 

t 

= + eF [T(V1) - Vl - (r„,,(vi) - Vl{l))e] (39) 

l=k 

Next, we shall show the convergence of ( [39l ) after the Martingale noise are averaged out. Let OJ' denote 
the optimal control action attaining the minimum in T„(V^). Let and denote the conditional 
per-stage reward vector and conditional average transition probability matrix under ^l^. Denote = 
Tnj(vi) -Viil), we have 

where Ci is some constant. Since -S* (/) = 0,Vi, by the assumption in (|2T] ). we have 

(1 - 6) min5*-'"(i') - Ci{w' - w'-"^) < 5* (i) < (1 - <5) max5*-"(i') - Ci{w' - w'-"^) Vz 

1 ' 7 / 




mini. Si{i') > (1 - -5) min^. - Ci{w^ - u;*"™) 

maxi, S^(i') < (1 - 5) maxi, 5^"™ (i') - Ci(u;* - u;*""^) 
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Therefore max^- 5* (i') - miiii. S^(i') < (1 - 6){maxi> - miiii- maxj. Si{i') - 

miiii' S'^(i') < (l)kYliJi (1 — (^fc+im), where cpk > 0. Since 5'^(/) = 0, we have maxj' S^(i') > and 
min,,S^(i') < 0. Thus, Vi, we have < max,, 5* (i') - min,, 5* (i') < (pkUl^t^C^ " ^k+im)- 

Therefore, as t — )■ oo, 5* — )■ 0, i.e. satisfies the fixed point equation (l22l) . is the potential 
vector, which is up to an constant vector ||5|. However, due to the property that S^{I) = Vt =^ 
= V^{I) yt, we have the convergence of the potential vector = limj_!.oo V^. 

Appendix C: Proof of Lemma [3] 

Due to the two time scale separation, the primal update of the per-cluster per-user potential can be 
regarded as converged to w.r.t the LMs {7^} at t-th slot. |[T6l . Using standard stochastic approximation 
theorem |[T5l . the dynamics of the LM 7^ for BS h (V6 G B) update equation in (fTSl) can be represented 
by the following ODE: 

ib{t) = e(^^(t(*))'^p(^(*))) [P^{t) -Pb],yb€B (40) 

where 0*(7(t)) is the converged policy in (fT6l ). Q*{'j{t)) is the converged policy minimizing the R.H.S. 
of ([H for each cluster n, and e("*(^(*^)'^^('^(*))) [•] denotes the expectation w.r.t the measure induced by 
{n*Mt)),n;{-rm. Define G(7) = E(^'(^)'^^('^))E^,^gcgj7,,Q'„,^^;jQl))], where 0;(7) = 
argminf7^(^)E('^^('^^''^''(^))E^^g^^„(7„,Q^,$7pJQ^))]. Since clustering pattern selection policy is 
discrete, we have ri*(7) = f^*(7 + S^). Hence, by chain rule, we have |^ = X^b'fc a^^^^ + 
_ p,]. Since = argminf,^(^) [^^^^^Uln, Ql %AQM 

we have |g = + eI^^^'^^'^^^'^O [P^, - P^] = ^f,{t). Therefore, we show that the ODE in & can 
be expressed as 7(t) = vG(7(t)). As a result, the ODE in ( |40b will converge to VG(7) = 0, which 
corresponds to the per-BS average power constraints in (O. 

Appendix D: Proof of Lemma [4] 

Substitute the transition probability in ((24)) into (fT3] ) and then apply standard optimization techniques to 

minimize the R.H.S. of (fT3l ). we can obtain the closed-form optimal power control policy for given CQSI 

/ =^Ai.,,y„(Q„) ^ s+ _ _ 

and CCSI: Pb,k{'Hn, Qn) = ^ u.. p - (1 + InAb,k)) , where Ab,kVn{Qn) = Vn{Qn) - 

Vn{[Qn — ^b,k]~^)- Similarly, substitute the transition probability in (l26l ) into (l25l) and then apply standard 
optimization techniques to minimize the R.H.S. of ([T3] ). we can obtain the closed-form optimal power 
control policy for given LQSI and CCSI: Pb,k{'H.n, Q) = ^ ^t'n^ p - (1 + 4,(b,fc)) > where 

\Z^b'£uj„ I ||"'((>,fc),6' II / 

AFn,(6,fc)(Q) = yn,{b,k){Q) - ^n,{b,k){iQ " 1]^)" 
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Solution of Bellman equation in (fT3l) can be obtained by offline relative value iteration Q. Without loss 
of generality, we set = as the reference state. Hence, we have normalizing equation V^^{Q!^) = 0, V/. 
Assume v[{Qn) = Efoe^„ Y.keic'^n,{b,k)iQb,k),yi- 

At the (Z — l)-th iteration, updating policy by minimizing the R.H.S of (fT3l) is given by Qb,k) = 

(SS^^S^ - filki^n,Qb,k) = niogil +pl,)\iln,Qb,kWb,k, Where fti,(Hn,Qb,k) 

is the mean departure rate, and AF^ ^.^(Qb^fc) = K!,(fe,fc)(<36,fc) - ^n.{b,k)(iQb,k - 1]"*") is the potential 
increment for the MS (6, /c)'s queue. 

At the l-th iteration, we determine the potential V^^(Qn) and 0^ by solving the normalizing equation 
^ri(Qn) = together with Iq^^ = {Nq + l)^!"^"! fixed point equations in (fT3l ). which is given by 

^i,(6,fc) = Y.beiu„ Y.k£K.^n,{b,k) 

^n,{b,k) = (9n,{b,k)hn,Qb,k,pi,k{Qb,k)) + Afe,feTAl^!, (5 (min{Q6,fc + 1,A^q}) - filk{Qb,k)rAvl^(h^k){Qb,k)) 

(41) 

where gn,{b,k) is obtained by applying interchange order of double summation over b and b' of ^„ 
in (fT4l) and decompose it into per-cluster per-user gn^(b,k)- P'b,k{Qbk) ~ ^IP-i ki^n, Ql k)\Qb k]- There 
are Iq^^ joint Q„ = {Q(^b,k))beio„,keK states, but there are only Nq + 1 states for Qb^k^b £ uin,k. 
Hence, in the original Iq^ fixed-point equations (|4TI ). there are only Nq + 1 independent fixed-point 
equations for each MS (6, fc) in (|4T]) . In addition, set l^^,(;,,fc)(0) = 0,\/b,k as the individual nor- 
malizing equation, which also satisfies v''n{Qn) = J2beuj„'^k^n,{b,k){^) = 0- Hence, in the l-th 
iteration, we can obtain {^L (;, fc)(Qb,fc)5 fc)} by solving each MS's equivalent Bellman equation 
in (|4TI) . Accordingly, {V^iQu), ^nl is the solution for the original Iq^^ fixed-point equations (1411 ). where 

VniQn) = I]feec^„ Y.k^n,{b,k){Qb,k) and 0l = J2b&uj^ Y.k^n,{b,ky 

Continue the iteration until the optimal policy converges. We obtain {l^n,(b,fc)(Q6,fc)) ^n,(fe,fc)}' and 
{Vn{Q,n),On} as a solution of (l25l) and (fT3l) . respectively. 

Appendix E: Proof of Lemma[5] 
The inter-cell interference power 4,^ is given as 4^(6^^) = Y.b'&ui^, Efe"e<^,.,,fc"g/c |h(fe,fe),b'W(fe»^fc»)^b, 
since the projection of (•)"'" is non-expansive [?], for any two power allocation vector P{i),P{2) G K^^, 

|W^^n,(6,fc)(P(l)) - ^-?^n,{b,fc)(P{2))l < ^ ^ \\b,k),b''^{b" ,k"),b'\^ ' \Pb" ,k"{l) - Pb- ,k-{2)\: V6 G ^, A; G /C 

n'^n k"£lC 

put the above inequality in vector form we have ||WF(p(i)) — WF(p(2))|| < ||S • (p(i) — P(2))ll' with 
the matrix S defined in (l32l ). With the weighted maximum norm || • vec and || • HJi^ mat defined above. 
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Vp(i),P(2) G and Vu G Rf^ we have ||WF(p(i)) - WF(p(2))||;i,,ec < l|S • (P(i) - P(2))ll;i,vec < 
l|S||SD,matll(P(i) ~ P(2))||TO,vec' which is a Contraction of the mapping WF, if HSHJ^^ niat < 1 is satisfied. 
By the theorem on convergence of contracting iterations [?], we can prove the existence and uniqueness 
of NE. 
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Fig. 1. Motivating Example of the advantage of QSI-based Clustering versus traditional CSI-based Clustering in Network MIMO 
Systems. For example, CSI-based clustering will always choose Clustering Pattern 1 (Red). This will create an interference profile 
in favor of MS2 and MS4 regardless of the queue state in MS2 and MS4. When the queues of MS2 and MS4 are empty, Clustering 
Pattern 1 will no longer be a good choice. On the contrary, the QSI-based clustering method will choose between Clustering 
Pattern 1 (red) and Clustering Pattern 2 (Blue) based on the queue states of the mobiles. As a result, it could dynamically creates 
a favorable interference profile to selected mobiles based on their queue states. 




(a) System Model 



(b) Control Architecture 



Fig. 2. System model and control architecture of network MIMO systems. The dotted lines and solid lines on Fig 2. (b) 
denote the control path and data path, respectively. 



Scheme 


CPU Time (s) 


BLl: FCA 


0.2218e-004 


BL2: Static Clustering (on CSI) 


1.9098e-004 


BL3: Greedy Dynamic Clustering (on CSI) 


0.0012 


Proposed Queue-aware Dynamic Clustering 


0.0094 



TABLE II 

Running time complexity of the baselines and the proposed scheme. The number of cells B = 19, the maximum cluster size 
Nb = 3, the number of MSs per BS K = 1, the number of antenna per-BS Nt = 4, the average arrival rate Xb,k = 10 

pck/slot and the resolution level d = 3. 
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SlI&teD State 
□ en ornState 



(I Start I ) 
'i- 



nitialization: 
I . Set t=0 

I. Each cluster n initialize per-cluster per-user conditional potentials 
!. Each BS, initialize the LM --,■!"' 



Online policy improvement: 
At the beginning of the t-th slot, 

1 . Clustering Pattern Selection: BSC determines clustering pattern ^' p ^ 
based on GQSI, 

2. Precoding Vector Calculation: Each CM determines {w(,, t, : //, h £ uj„} 
and initial power control |p^,|based on CCSI, 

3. Per-stage Interference Game: CMs improve power control -p'^^ —> p* 
based on observed ICI. 



Online Learning Algorithm: 
At the end of the t-th slot, 

1 . Potential Functions Update: CM of each cluster n updates potential 

2. LMs Update: Each BS updates its LM -yl' 



□ QSI 



Clustering Pattern 
Selection 



C n 



CQSI 



{V;, |,,,,| : /)6 u;„,A- e K) 



*% Precoder Calculation 
and Power Control 



QSI 



ICI 



= {w(t,i).fc'.P.,} 



Fig. 3. Algorithm Flow of the Proposed Online Distributive Primal-Dual Learning Algorithm with Per-stage QSI-aware 
Interference Game and Simultaneous Updates on Potential Functions and LMs. 




Fig. 4. Probability that condition ||S||Jil3.n,at < 1 is satisfied versus user location. The number of cells B = 19, the number of 
MSs per cell K — 1, the maximum cluster size Nb ~ 3, the number of transmit antennas Nt = 4. 
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Tx Power (dBm) Tx Power (dBm) 



(a) Average delay per user versus transmit power at the number of (b) Average delay per user versus transmit power at the maximum 
antenna per-BS Nt = 4,2 and the maximum cluster size Ng = 3. cluster size Ng = 3, 4 and the number of antenna per-BS Nt = 4. 



Fig. 5. Average delay per user versus transmit power. The number of MSs per BS K = 1, the average arrival rate Xb,k = 10 
pck/slot and the resolution level d — 3. 




4 6 8 10 12 14 16 



Packet Arrival Rate (pkt/sec] 



Fig. 6. Average delay per user versus per-user loading (average arrival rate Xb,k) at the number of MSs per BS K = 1 and 
K = 2 at the transmit power Pt = 35 dbm. The maximum cluster size Ng ~ 3, the number of transmit antenna at each BS 
Nt = 4, and the resolution level d = 3. 
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Fig. 7. Convergence property of the proposed distributive stocliastic learning algorithm via stochastic approximation. The 
transmit power Pt = 35 dbm, the maximum cluster size Nb ~ 3, the number of transmit antenna at each BS Nt — 4, the 
number of MSs per BS K = 1 and the resolution level d = 3. The figure illustrates instantaneous per-cluster potential function 
values versus instantaneous slot index. The boxes indicated the mean delay of various schemes at three selected slot indices. 
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